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Abstract 



There has been an explosion of interest in statistical models for analyzing network data, and considerable 
interest in the class of exponential random graph (ERG) models, especially in connection with difficulties in 
computing maximum likelihood estimates. The issues associated with these difficulties relate to the broader 
structure of discrete exponential families. This paper re-examines the issues in two parts. First we consider 
the closure of fc-dimensional exponential families of distribution with discrete base measure and polyhedral 
convex support P. We show that the normal fan of P is a geometric object that plays a fundamental role 
in deriving the statistical and geometric properties of the corresponding extended exponential families. 
We discuss its relevance to maximum likelihood estimation, both from a theoretical and computational 
standpoint. Second, we apply our results to the analysis of ERG models. In particular, by means of a 
detailed example, we provide some characterization of the properties of ERG models, and, in particular, of 
certain behaviors of ERG models known as degeneracy. 

1 Introduction 

Our motivation for the work described in this paper comes from the analysis of network data using models 
representable by graphs, where the nodes correspond to individuals and the edges to relations or linkages 
among them. Such graphical representation has a long history, dating back to Moreno (1934), and was 
recast within the exponential family framework by Holland and Leinhardt (1981) and Frank and Strauss 
(1986) (see also Strauss and Ikeda, 1990). Their work led to the development of the broader class of 
exponential random graph (ERG), or p*, models for social networks (see, e.g. Wasserman and Pattison, 
1996), but likelihood methods for their analysis remained out of reach until earlier this decade. For a 
broad review of these and other network models, see Goldenberg et.al. (2009). Recent work on maximum 
likelihood estimation for ERG models, however, has pointed to difficulties that have been characterized as 
"degeneracies" or "near degeneracies" by Handcock (2003) and Hunter et al. (2008). The explanation for 
these difficulties lies within broader characterizations of "degeneracies" for discrete exponential families. 
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Exponential families are one of the most important and widespread class of parametric statistical mod- 
els, whose remarkable properties have long been established in the statistical literature (see, e.g., Bardoff- 
Nielsen, 1978; Brown, 1986; Letac, 1992). Among the most interesting features of exponential families 
is the notion of the closure of the family, known as the extended exponential family, whose mathematical 
theory has been recently worked out in great generality (see Csiszar and Matus, 2001, 2003, 2005, 2008). 
The study of the extended families is particularly important, as it may directly pertain to the existence of 
the maximum likelihood estimates and to the estimability of the natural parameters. This is the case for 
discrete exponential families, for which the maximum likelihood estimates may not exist with some positive 
probability. A notable instance is the class of log-linear models, for which existence of the MLE and closure 
of the family can be characterized in a purely geometric fashion (see, e.g., Eriksson et al., 2006; Geiger et 
al., 2006; Rinaldo, 2006a). 

In this article we are concerned with discrete linear exponential families. In the first part of the paper, 
we show that the geometric and statistical properties of the extended family depend in a fundamental way 
on the normal fan of the convex support. In particular, the normal fan can be used to characterize non- 
identifiability of the families in the closure, to represent the densities in the extended family as almost sure 
limits of the densities in the original family along certain directions of the parameter space and to describe 
the directions of recession of the (negative) log-likelihood function. 

As an application of our results, in the second part of the paper we turn our attention to exponential 
random graph models, a particular class of discrete linear exponential families. Our discussion is based an 
the detailed analysis of the ERG model on a the graphs on 9 nodes with two-dimensional sufficient statistics 
consisting of the number of edges and the number of triangles. We use Shannon's entropy function to 
illustrate graphically how concentrated the distributions in this family are, viewed as functions of both the 
natural and mean value parameters. Besides illustrating the theoretical results derived in the first part of 
the article, our analysis sheds light on a variety of pathological behaviors observed in practice while fitting 
ERG models known as degeneracy (see, e.g., Handcock, 2003), and, more generally, on the qualities and 
attributes of ERG models. Our analyses indicate that perhaps network analysts and methodologists attribute 
to ERG models a degree of regularity that they may not possess. 

The remainder of this article is organized as follows. In Section 2 we provide the derivation of our 
ket theoretical results. In Section 2.1, we begin by describing our settings and briefly review the theory of 
extended exponential families and their fundamental properties. Then Section 2.2, we introduce the notions 
of normal cones and the normal fan to the convex support of the family. In Section 2.3 we state our main 
result and a discussion of its corollaries, while Section 2.4 presents come computational considerations 
concerning maximum likelihood estimation for extended exponential families. Section 3 consists of an 
application of our results to ERG models. First in Section 3.1 we introduce the class of ERG models and 
then in Section 3.2 we present our running example of an ERG model on the set of all graphs on 9 nodes. 
We next introduce the concept of degeneracy for ERG models in Section 3.3, while in Section 3.4 we use our 
theoretical results to illustrate graphically the features of the model in the running example of Section 3.2 
to show how degeneracy arises. The appendices contains the proofs and some additional result on how to 
establish existence of the maximum likelihood estimates in discrete linear exponential families using linear 
programming. 

We end this section by establishing the notation we will be using throughout. For two vectors x and y in 
R'^, {x,y) = X^iLi ^iVi denotes their inner product. The Eucludean norm of a vector x is |jx||2 = ^/ {x, y). 
If j4 is a subset of W^, we indicate with convhull(yl) its convex hull and with cone(yl) the set of all of its 
conic combinations. Finally, for any A c W^, possibly of dimension less than d, its relative interior ri(v4) is 
its interior relative to convhull(v4). 
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2 Extended Exponential Families with Polyhedral Support 



2.1 Settings 

In this section we introduce the statistical and geometric background needed for our results. We will assume 
throughout some familiarity with the general theory of exponential families and the basics of polyhedral 
geometry. For more complete treatments, consult Bardoff-Nielsen (1978), Brown (1986), Csiszar and Matus 
(2001, 2003, 2005, 2008) and Rinaldo (2006a) for material on exponential families, and Ziegler (1996) and 
Schrijver (1998) for introductions to polyhedral geometry. 

We consider an exponential family of distributions £p on R'^ with densities 



is the natural parameter space and ip{9) = log /jj^ exp^"^'^^ di'{x) the log-partition function. The support of 
£p is the closure of the set {x : ^{x) > 0}, while the convex support P is the closure of the convex hull of the 
support of £p. We will assume throughout the paper that 

(Al) ly has countable support; 

(A2) P is a full-dimensional polyhedron in R'^, that is, P does not belong to any proper affine subspace of 



(A3) for each face _F of P, F = convhuU (S'i?), for some set Sp C supp(i^); 
(A4) the natural parameter space 9 is an open set. 

Assumptions (Al) and (A2) imply, in particular, that the family is in minimal form and, therefore, identifi- 
able. We remark that assumption (A2) is not necessary and is imposed to simplify the exposition; our results 
would still hold with some minor changes without assumption (A2), and the cost of additional technicali- 
ties in the proofs. In fact, any degenerate exponential family can be made full by taking appropriate affine 
transformations, a procedure known as reduction to minimality (see, e.g.. Theorem 1.9 in Brown, 1986 or 
Lemma 8.1 in Bardoff-Nielsen, 1978). Assumption (A3) is needed to guarantee the existence of probability 
distributions supported over the boundary of P, which is an indispensable feature of the extended exponen- 
tial family, described in the next section. It could be easily relaxed by allowing some faces to have zero v 
measure. Finally, assumption (A4) is a standard. In particular, for our discussion of ERG models, 6 = R''. 

2.1.1 Basics of Extended Exponential Families 

Letting X — xhe the observed sample from an unknown distribution in £p, the random set 



is the maximum likelihood estimate, or MLE, oi 9. If 6* = the MLE is said to be nonexistent. Existence of 
the MLE is determined by the geometry of P, as indicated by the following well-known, fundamental result 
(see, e.g.. Theorem 5.5 in Brown, 1986 or Proposition 4.2 Rinaldo, 2006a for different proofs). 

Theorem 2.1. Under the current settings, the MLE 6 exists and is unique if and only if x £ rclint(P). 



pg{x) = exp{{x,e)-^/j{e)}, 0ee, 



with respect to some base measure i^, where 





(1) 
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Furthermore, setting Eg{X) — J^^ zpg{z)di'{z), because of the minimahty of fp, the mean value parametriza- 
tion map 

W: int(e) 1-^ relint(P) 

given by 

V^{0) = EgiX), (2) 

is a homeomorphism, so that one can equivalently represent any distribution in £p using the natural param- 
eter 9 or the mean value parameter ^ — Ee{X) e relint(P). In particular, if the MLE exists, it is determined 
by the equation 

which translates into the moment equation Eg{X) = x. 

For any proper face F, let vp be the restriction of v to F. Then, vp determines a nev^f exponential family 
of distributions Ep, v^fith densities v^fith respect to vp given by 

=exp{(x,0) -^^(0)}, OeQp, 

where the natural parameter space is 6_f = {0 G 6 : /j^^ exp^^'*^ dvp{x) < oo} and the log-partition function 
is V'^(^) = log /jjfc exp<^'^> dvp{x). Notice that, since exp<^'^> dvp{x) < J^^ exp<^'^> di'{x), 6 = 6_f. By 
assumption (A3), the convex support of this new family is F and the existence result of Theorem 2.1 carries 
over: the MLE exists if and only if the observed sample x belongs to relint(i^). However, since £p is supported 
on a lower-dimensional affine subspace of M''', it is no longer minimal, hence the MLE is not unique, and it 
consists instead of many solutions to (1); see Corollary 2.9 below for details. Nonetheless, via reduction to 
minimality (see, e.g.. Brown, 1986, Theorem 1.9), it can be verified that, when 9 is not empty, it consists 
exactly of those points satisfying the first order optimality conditions 

x^w^jp{e), yeee, (3) 

with the corresponding moment equations Eg{X) — J^^^ zpg{z)diy^{z) — x,W9 & 6, still holding. In fact, 
lack of minimality bears not effect on the mean value parametrization: for every 9 £ Qp, there exists one 
point X e ri(F) such that 

Eg[X] = X, (4) 

and, similarly, for any x e n{F), there exists a set Op c Qp, depending on x, such that (4) holds for all 
9 & 9p. See equation (10) below for a characterization oi 9p. 
The collection of distributions 

£ = \J£f 

F 

as F ranges over all the faces of P, including P itself, is called the extended exponential family of distribution. 
With respect to the extended family £, for any observed sample X ~ x, the MLE, or extended MLE, is always 
well defined and is the set of solutions to (3), where F is the unique face containing x in its relative interior 



2.2 Extended Exponential Families and The Normal Fan of P 

In this section we introduce the notion of normal fan of P and establish its relevance for the extended family 
£. See Lemma 7.2 in Appendix B for some basic properties of the normal cones and of the normal fan. 
By assumption (Al) and (A2), there exists a m x k matrix A and a vector b e ]R™ such that 

F^ixeR'': Ax<b}, (5) 

where the system contains no implicit equalities. A proper face of P is a subset of P defined by 

F ^ \x eP: Apx = bp], (6) 
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for some subsystem Apx < bp of Ax < b and, therefore, it is itself a polyhedron. The whole polyhedron P is 
regarded as the improper face of itself associated to the full system of inequalities, so that P is representable 
as the disjoint union of the relative interiors of all its faces. The dimension of a face F, dim(i^), is the 
dimension of the affine subspace it generates or, equivalently the dimension of the null space of Ap. Faces 
of dimension fc — 1 are called facets of P and, if the system (5) has no redundant inequality, something 
which can always be assumed without loss of generality, the number m of rows of A match the number of 
facets. Equation (5) is known the H representation of P. Alternatively, P could be described using the V 
representation as the sum of a polytope and a polyhedral cone: 

P = Q + C, (7) 

where the sign + denotes Minkowski addition, and Q = convhull(Q) and C = cone(C), with Q and C two 
finite sets of vectors in M''. Throughout the paper, we will rely on the TL representation (5), which we find 
more suited to our purposes, although our results could be established using (7). 
For every face F of P, let 

Np = \ceR'': F C {x eP: {c,x) = max(c, y)}} 

be the polyhedral cone consisting of all the linear functional on P that are maximal over F, called the 
normal cone of F. Then, dim(iVi?) ~ k — dim(i^), so that larger faces of P correspond to smaller normal 
cones. By Lemma 7.2 part 5., the normal cone of a proper face F can be equivalently defined as 

Np =cone(ai,...,a,„^), 

where denotes the transpose of the i-th row of the submatrix Ap given in (6), where i — 1 . . . ,mp. 
The collection of cones 

7V(P) {Np, Fisa face of P} 

forms a polyhedral complex in M'^ (see, e.g. Sturmfels, 1995), called the normal fan of P. Notice that, since 
dim(P) — k, Np — {0} and J\f{P) is pointed. Furthermore, 

1+) int(iVf)=C*, 

where C* — {x e R'' : (x, y) < 0, V?/ e C} is the polar of C in the V representation (7) of P and [+J denotes 
disjoint union. In particular, if C = {0}, i.e. if P is a full-dimensional polytope, the cones in Af(P) partition 
R^: 

(+) int(^i.) = M^ (8) 

We mention that, more generally, if assumption (A2) is not in force, then A^p is a linear subspace of R'^ of 
codimension k — dim(P). 

Let lin(A^i?) denote the subspace generated by Np, which is the linear subspace spanned by the vectors 
(ai, . . . , amp)- The following lemma shows that, for every face F of the convex support, the parameter space 
of the extended family Ep can be fully described using lin(A^i?). 

Lemma 2.2. The family Ep is non-identifiable and Qp is the quotient space o/6 modulo lin(A^i?). Furthermore, 
for any e lm{Np), 

rank {If{9 + C)) = dim(F), (9) 
where /(•) and If{-) denote the Fisher information matrices for £p and Ep, respectively. 

The previous result characterizes Qp a.s the set of equivalence classes of points in 6, where 6i and 62 are 
in the same class if and only if 9i — 62 £ lin(A^i?), and the class containing 9 £ Q is the set 

ep = {e + CeQ,Celm{Np)}, (10) 
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which we call the congruence class of 9 modulo lin(iVF). Notice that if 8 = K'^, then 6f is comprised of affine 
subspaces of dimension dim(A^i?) — k — dim(i^) parallel to lin(7Vi^), each identifying a single distribution. 
In particular, when F — F, lin(iVi?) — {0}, so that 9f is an atomic set and we recover the original, fully 
identifiable family £p. 

2.3 Main result 

We will utilize the normal fan JV(P) to chracterize the following convergence statements: 

Pe,, Pep, a-e- v., and ^ /i^ e relint(F), (11) 

where ^„ = Eg^JX]. We take note that, because of the one-to-one correspondence between natural and 
mean value parameters for the families comprising E, the two statements imply each other Equation (11) 
is of relevance as it explicitly provides various representation of the extended family E as the closure of the 
original family £p in both natural and mean value parameterization and also in terms of almost sure limits 
of the densities in £p. 

As a preliminary observation, we point out that (11) holds true only if the parameters 0„ have diverging 
norms, so thatp^^ cannot belong to £p. Formally, 

Lemma 2.3. If (11) is verified, then \\0n\\2 co- 
in our main result, we establish establish sufficient conditions under which (11) holds or fails, based on 
the cones in the normal fan of P. 

Theorem 2.4. Consider the settings describe above and assumptions (A1)-(A4). Let {9„} c 6 be a sequence 
of natural parameters satisfying 9„ = r] + pndn, where {pn} is a sequence of non-negative scalars tending to 
infinity, rj e 9^ (IQ and {d„} is a sequence of unit vectors. 

1. If {dn} c R, with R a compact subset ofri{Np), then Equation (11) holds 

2. Conversely, if {c?„} C R, with R a compact subset of Np, then (11) fails. 

3. If {dn} c R, with R a compact subset {J\f{P)y, then 

llA^nlb -> oo, (12) 
which, in particular, implies that (11) is not verified. 
Remark 

1. The assumption ||d„||2 = 1 for all n is imposed for mathematical convenience and does not entail any 
loss in generality. 

2. The Theorem shows that (11) will hold or fail uniformly over compact subsets of ri(iV;-), for all faces 
FofP. 

Below, we will concern ourselves with sequences {6'„} of natural parameters of a certain simplified form, 
as described in below. 

Definition 2.5. A sequence of natural parameters c 6 is a (9, d, {p„})-sequence if 

9n = + Pad, 

where e 9, d e M*^ and {p„ } is a sequence of non-negative numbers tending to infinity. 

The restriction to {9, d, {/9„})-sequences is a strong enough condition to yield a full characterization of 
(11), as described in the next corollary, and yet sufficiently mild to unveil some of the fundamental features 
of the extended family E. Furthermore, it will allow us to recast some of our results in the language of 
convexity theory and gain some insights on the computational aspects of calculating the extended MLE. 
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Corollary 2.6. Let {0„} be a {9, d, {pn})-sequence. 

1. The convergence statements in (11) hold if and only if d £ ri^Np). 

2. Ifd<^ Af{P), then (12) is verified. 

In essence, Corollary 2.6 characterize the extended family £ as the compactification of the original family 
£p under both natural and mean value parametrization. For the natural parametrization, each density in 
f is obtained as the point- wise limit of sequences of densities parametrized by sequences of points in 6 
along any direction in nlNp) with norms diverging to infinity. In contrast, the corresponding sequence of 
mean value parameters converges gracefully to the corresponding point of finite norm on the boundary of 
P. This is a striking difference between natural and mean value parametrization, which is entirely captured 
by the norma fan of P. See Figures 4 and 3 below and related discussion for more details in the context of 
ERG models. See also the short movies available http : //www . stat . emu . edu/~arinaldo/ERG/ for a direct 
graphical illustration of these claims. 

In the remaining of this Section, we will explore some of the consequences of Theorem 2.4 and, in 
particular, of Corollary 2.6, with the goal of illustrating some of the key properties of the extended family £. 

We begin by observing that, as shown in Equation (20) in the proof of Theorem 2.4, if ^ Np, the 
sequence of distributions parametrized by the points 9„ = + pnd corresponds to distributions in the original 
family Ep whose mean value parameters /i„ e relint(P) are such that ||^„||2 oo, with ^„ bounded away 
from rb(P). It is clear that this can occur only if C {0}, i.e. if the convex support is unbounded. In fact, 
when P is a polytope, we have N{P) = M.^ (see Equation 8), so that Corollary 2.6 further yields that each 
density in the family Ep can be obtained as lim„ pg^ , where {0„} is any {9, {p„}, d)-sequence with 9 <eQ and 
d e T:i{Np). Formally, 

Corollary 2.7. If P is a polytope, then, for any d e M*^, any {9, {pn}, d)-sequence {6'„} and any face F, 

Pe„ Pep-, a.e. v, and p.n p^ ^ relint(F), 

if and only ifde n{Np). 

In fact, our analysis of exponential random graph models of Section 3.4 is almost entirely an illustration 
of the previous Corollary. 

Another implication of Corollary 2.6 is that, when the MLE does not exist, the directions of increase of 
the likelihood function for a given observed sample x £ relint(i^) are precisely the points in the associated 
normal cone Np. Formally, let X = x be the observed sufficient statistics and let : 6 i-^ M be the log- 
likelihood function, given by £x{0) = log pg{x). Then, —l^ is a proper convex function, strictly convex if and 
only if a; e ri(P). This follows from Lemma 2.2 and the well-known convexity properties of the cumulant 
generating function tp (see, e.g.. Brown, 1986, Theorem 1.13). Then, following Rockafellar (1970, Chapter 
8), d e M*^ is a direction of recession for —i^ if 

lim infp^oo^2:(^ + pd) < oo, (13) 

for one, and thus for all, 9 £ dom(£^) — 6. The set of all directions of recession of —i^ is called the recession 
cone of £x- It is clear that convex functions admitting directions of recession might not achieve their infimum 
at any point in their effective domain. On the account of the next result, the recession cone of —i^ is a cone 
of the normal fan of P, almost everywhere v. 

Corollary 2.8. For any observable sufficient statistics X ~ x £ P, the polyhedral cone Np is the recession 
cone of the negative log-likelihood function ~£x, where F is the unique, possibly improper, face ofP such that 
X £ relint(i^). 

In particular, when x £ relint(P), i.e. when the MLE exists, the corresponding recession cone is just the 
point {0} (since dini(P) ~ k), so that the negative log-likelihood function does not have any direction of 
recession and, therefore, its supremum is achieved at one parameter point G M'^ with finite norm, namely 
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the MLE. On the other hand, when the MLE is nonexistent, the HkeHhood function increases for any sequence 
of natural parameters with norm diverging to infinity along any direction d e Np, where Np is the normal 
cone of the face of P containing the observed sufficient statistics in its relative interior 

We note that Corollary 2.8 could be stated in a more general form. Indeed, for any ^ e P, letting 
i^: 6 1-^ M be given by 

it can be verified that the proof of Corollary 2.8 still holds with replaced by i^. Though theoretically 
relevant, this fact has little practical value. 

During the preparation of the paper, we learned of similar results in Geyer (2008), which are based 
on the characterization of the convex support in term of the tangent cones and normal cones. While his 
analysis applies to more general classes of exponential families, our results are more refined, as we take 
full advantage of the polyhedral assumption and establish a more direct connections between the extended 
families and the cones in the normal fan. 

By combining the results derived so far, we next show that, when x € relint(F), the extended MLE will 
be the affine subspace of dimension dim{Np) given by 6p, where E^^ = x. Though not entirely a new result 
(see Brown, 1986, Chapter 6), our proof and the characterization of Op in terms of Np is novel. 

Corollary 2.9. Let x e rclint(F) and 9p he the congruence class of 9 modulo lm{Np) such that Eg^[X] = x. 
Then, 

snppeix) =pS (x). 

For completeness, we conclude this section by linking our discussion with alternative characterizations 
of the closure of the family £p existing in the literature, which could be easily obtained using Theorem 2.4 
(see, in particular, Csiszar and Matus, 2001, 2003, 2005, 2008). 

Corollary 2.10. For any {9, {pn}, d)-sequence {0„} with d e n{Np), 

i) Pe„ — ^ Pqj,, where denotes convergence in total variation; 

ii) lini„ K{Pf^ , Pe„ ) = 0, where K{P, Q) is the Kullback-Lieber divergence of P from Q; 
Hi) Pg^ ^ Pf^, where the =4> denotes convergence in distribution. 

2.4 Computational considerations 

Based on our findings, we can make a few observations regarding the computational difficulties of finding 
the extended MLE, some of which are exemplified in the next result. 

Corollary 2.11. Let {6'„} be a {9, {p„}, d)-sequence, with d e ri{Np). Then, for every ( e lm{Np), 

I{9^)-^Ip{9 + 0, (14) 

where convergence is pointwise. 

From the corollary and equation (9), we can infer that, when the MLE does not exist, maximizing the log- 
likelihood function using the Newton Rapson method, as well as virtually any other fastest ascent methods, 
may fail due to numerical instabilities. In fact, the Newton Rapson algorithm proceeds by finding a sequence 
{9n} of natural parameters along which increases most rapidly. At each step of the procedure, the next 
point in the sequence is determined by the direction of fastest ascent of i^, given by the inverse of the Hessian, 
e.g. by the inverse of /(f?„). However, for all n large enough, these matrices will be badly conditioned, since, 
at the optimum, the Fisher information matrix is not invertible (see equation 9). In addition, especially when 
the observed statistics x belong to the relative interior of a face of small dimension, these singularities can 
be dramatic, not to mention the fact that, unless x lies on a the relative interior of a facet, there is an infinite 
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number of directions along which the Hkelihood function increases. It is apparent that these problems are 
even more accentuated in high-dimensional settings or whenever the data are sparse. From the statistical 
standpoint, equations (14) and (9) further imply that, when the MLE does not exist, the standard error 
may be quite large (in the limit, infinite), and that the number of degrees of freedom should be adjusted 
to reflect the non-estimability of some parameters. As a result, any hj^othesis testing or model selection 
procedure that rely solely on these estimates should be regarded, at the very least, unreliable. Based on these 
considerations, it is clear that not only is the task of computing the extended MLE particularly daunting, but 
the statistical interpretation of these quantities is also rather delicate. 

We refer the reader to Geyer (2008) and Rinaldo (2006b) for different algorithmic approaches on com- 
puting the extended MLE for certain types of exponential models with polyhedral support for which a V 
representation of P of the form (5) or (7) is either available or easily computable. We remark that, in order 
to determine the extended MLE, it is necessary not only to have an explicit representation of P but, in addi- 
tion, to be able to have in closed form the log-partition functions ip^, for each face F. A class of models for 
which both conditions are satisfied is the class of the log-linear models. If this information is not available, 
one may resort to MCMC techniques for computing the MLE or a pseudo-MLE, as for the class of models to 
be described in the next section. See Geyer and Thompson (1992), and Handcock (2003), Snijders (2002), 
Wasserman and Robins (2004), Handcock et al. (2006), Hunter et al. (2008) and references therein. 

As a final comment, we point out that, while computing the extended MLE is very often a hard problem, 
deciding whether the MLE exists is typically more feasible, and can be accomplished using linear program- 
ming, provided an explicit representation, namely a 7i or a V representation, of P is available. See Appendix 
C for details and also Eriksson et al. (2006) for an application to hierarchical log-linear models. 

3 Application to Exponential Random Graph Models 

We now apply some of the results from the previous section to the class of exponential random graph 
models. The motivation for our choice is the attempt to explain certain features of ERG models that have 
been observed empirically and have been collectively labeled as degeneracy (see, e.g., Handcock, 2003). 
Our point of view is simply that there is nothing degenerate or unusual about these models, whose behavior 
can in fact be explained in a direct way using the properties of exponential families with polyhedral support 
as described in Section 2.3. 

Our arguments rely on a thorough analysis of one ERG model, described below in Section 3.2, and on 
graphical renderings of Corollary 2.7, which we find particularly effective and elucidative of our results. We 
looked at a variety of other ERG models on 7,8 and 9 nodes, using different choices of the network statistics 
described below, and arrived to the same kind of conclusions we are about to present. 

Finally, we would like to emphasize that, as the log-partition function is not available in closed form, an 
exact analysis of ERG models on larger graphs is almost impossible. This is due to the need to enumerate all 
possible graph with a given number of nodes in order to evaluate that function, a task whose computational 
computationally becomes prohibitive very rapidly as the number nodes grow; see Equation (15) below and 
Table 1. 

3.1 Introduction to ERG models 

There is an extensive literature of ERG models and their use in social network analysis. A partial but rep- 
resentative list of references is: Holland and Leinhardt (1981), Frank and Strauss (1986), Wasserman and 
Pattison (1996), Wasserman and Robins (2004), Robins et al. (2007a,b) and references therein. Below we 
briefly describe the settings for ERG models, in order to make explicit the connections with the material in 
the previous sections. 

Consider the set Qg of all possible simple, i.e. unweighted, undirected and without loops, graphs on g 
nodes. Every such graph x can be described by a 0-1 symmetric g x g adjacency matrix, whose {i,j)-th 
entry is 1 if there exists an edge between the nodes i and j and otherwise. Thus, x can be represented 
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Number of nodes: g 


Number of edges: (|) 


Number of graphs: \Qg\ 


7 


21 


2,097,152 


8 


28 


268,435,456 


9 


36 


68,719,476,736 


10 


45 


35,184,372,088,832 



Table 1: Some information about the complexity of some ERG models on small graphs. 

as a (^) -dimensional 0-1 vector The cardinality of Qg grows super-exponentially in the number of nodes n, 
namely 

1^51 = 2(^), (15) 

so that network modeling entails constructing probability distributions over very large discrete spaces (see 
Table 1). 

Let T : C/g i— > K*^ be a vector valued function of network statistics quantifying the key features of interest of 
a given observed graph. In this article we are mostly concerned with ERG models arising from network statis- 
tics that capture rather general and aggregate features of the network. Typical examples of such statistics 
are (see, e.g., Goodreau, 2007, for more details): 



1. the number of edges: J2 



i<j ^ij 

2. the number of triangles: J2i<j<hXij^jhXih 

3. the fc-degree statistic: Dk{x) — J2i=i = where di ~ Xij is the degree of the i-th node and 
< fc < 71 - 1; 

4. the number of fc-stars: J2i=k 2 < fc < n—1, i.e. the number of distinct edges that are incident 
to the same node, where Di{x) is the i-the degree statistic given above; 

5. the alternating fc-star statistic 

where A is a positive parameter 

For all modeling purposes, these network statistics are effectively regarded as sufficient statistics and, by 
Koopman-Pitman-Darmois theorem, the resulting exponential family of distributions provides a convenient 
statistical model for Qg. Formally, given a set of network statistics in the form of a fc-valued function T{-) on 
Qg, the ERG model V = {Qg, 6* e 6 C E''} is the exponential family of probability distributions over Qg with 
natural sufficient statistics T{x) and base measure /x given by the counting measure on Qg. Thus, for 6 & Q, 
the density of Qg with respect to is 

^{x) = qeix) = exp{(r(x), 0) - m} = Prob{X ^ x}. 

Let T = {< e M'^ : t = T{x),x e Qg} be the range of T(-) and v the measure on T induced by fi, namely 

iy{t) ^ fi{x e Qg : T{x) = t} = \{x e Qg : T{x) = t}\ , teT. 

Then, the distribution of T{X) belongs to the exponential family of distributions on T with base measure z/, 
natural parameter space 6 and densities 

peit)^exp{{t,9)-^p{e)}, eee. 
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Furtheremore, because of the discreteness of the problem, 

Prob{T{X) ^ t) ^ I qe{x)dn{x)= ^ qg{x) = pg{t)iy{t). 

JixeGg : T{x)=t} {xeGg : T(x)=t} 

Provided that the network statistics are affinely independent, as it is the case for the examples given above 
and as it can alv^fays be assumed through reduction to minimality, the convex support P ~ convhull(T) is a 
/c-dimensional ploytope. Recalling that v is finite, it is easy to see that the assumptions (A1)-(A4) of Section 

2.1 are verified, and the theory developed above applies. 

Despite its simplicity and interpretability, v^fe need to emphasize that ERG modeling based on simple, 
lov^f dimensional network statistics such as the ones described above can be rather coarse. In fact, those 
ERG models are invariant with respect to the relabeling of the nodes and even to changes in the graph 
topologies, depending on the network statistics themselves. As a result, they do not specify distributions 
over graphs per se, but rather distributions over large classes of graphs having the same network statistics. 
Consequently, as we repeatedly observed in our experiments and as elucidates in the example we are about 
to present, it may very well be the case that many graphs having very different topologies still belong to the 
same class and, therefore, are considered as equivalent. While this feature may be well suited for defining 
distributions over large thermodjTiamic ensambles in statistical physics, its use in other contexts in which 
the nodes are not interchangeable may be questionable. This is certainly not a common feature of all ERG 
models: for example, the pi model by Holland and Leinhardt (1981) and the Markov graphs by Frank and 
Strauss (1986) are based on much finer network statistics whose dimension, unlike the aggregate statistics 
reported above, increases with the size of the network. These more complex models represent explicitly 
distributions of individual networks rather than of classes on networks: both pi and Markov graph models 
are log-linear models over the probability of edges (see Fienberg and Wasserman, 1981). However, they 
also present difficulties. In fact, not only is the MLE not likely to exist if the observed network is even 
moderately sparse, but the asymptotics of these models as g grows remains unknown (see, e.g. Haberman, 
1981, for some comments on pi models). While the theory developed in the previous sections apply to all 
ERG models, our analysis below is more directly relevant to models arising from simpler network statistics 
quantifying macroscopic properties of the network. 

3.2 Our Running Example 

We will be using throughout the example of a ERG model on Qg with two-dimensional network statistic 
T{x) — {Ti{x) ,T2{x)) e N^, where Ti{x) is the number of edges and T2{x) is the number of triangles. 
Note that this model is not hierarchical in the sense of Bishop et al. (1975) and Lee and Nelder (1996), 
since we do not include the network statistic for the number of 2-stars, which lie intermediate to edges 
and triangles. The lack of hierarchical model structure affects the interpretation of the exponential family 
parameters corresponding to T{x) but turns out not to be the cause of the degeneracies we illustrate. We 
have actually produced similar results for models which are fully hierarchical, but the results are easier to 
demonstrate in the context of this ERG model with a two-dimensional network statistic. 

The number of distinct graphs for this Qq example is 2^^, while the number of two-dimensional distinct 
network statistics is only (2) (3) = 444. The natural parameter space is the entire M?. The support of the 
distribution of T{X) is shown in Figure 1. The convex support for the induced family of distributions of 
network statistics is a polygon with 6 edges, whose boundary is depicted with the red solid line. Out of the 
possible 444 points, 29 actually lie on the boundary. The induced base measure v for this family, i.e. the 
frequencies of each possible pair of network statistics, is indicated by the color shading of the circles. The 
maximal value of ^{t) is 1, 876, 664, 161, the median value is 2, 741, 130, while the first and third quartiles are 
545, 265 and 79, 674, 084, respectively. Figure 2 shows a plot of the empirical quantile function for i'{t),t e T, 
which indicates that few network configurations are much more frequent than others. 
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Support of the edge and triangle statistics 




Number of edges 

Figure 1 : Support of the distribution of the network statistics for the ERG model on Qg described in Section 
3.2. The color shading indicates the squared root of the relative frequency of each point, namely iy{t) (darker 
colors correspond to higher-frequency values of t). The solid red line is the boundary of the convex support. 

3.3 Degeneracy 

The notion of degeneracy is central to ERG modeling, and has been investigated in various forms in the 
more recent literature. See Snijders (2002), Robins et al. (2007b), Robins et al. (2007a) and, in particular, 
Handcock (2003) and Hunter et al. (2008), just to mention a few. Degeneracy refers quite broadly to a 
variety of features, typically undesirable and surprising, of ERG models that have been observed empirically. 
In the literature, degeneracy (or near degeneracy) is used to describe any of the following, often interrelated, 
phenomena: 

1. when a combination of ERG parameters 6 implies that only a very small number of distinct graphs have 
substantial non-zero probabilities; in the most extreme cases, these configurations are the empty graph 
or the fully connected graph; 

2. when, for a certain combination of ERG parameters 9, the density function pg has multiple, clearly 
distinct, modes, and there are only very few network configurations that have non-zero probabilities, 
often radically different from each other; 

3. when the MLE of is nonexistent or hard to obtain, or the MCMCMLE of fails to converge or appears 
to converge extremely slowly; 

4. when the estimate of would make the observed network configuration very unlikely. 

Each of the situations just described offers strong evidence of misspecifcation or, at the very least, of the 
inability of the model to describe in a realistic fashion the observed network. To our knowledge, Handcock 
(2003) is the only attempt to characterize degeneracy in a theoretical way, at least the kind of degeneracy 
yielding unstable maximum likelihood estimates, with emphasis on MCMCMLE methods. 

3.4 Degeneracy via Entropy Functions 

We based our analysis on a basic observation: a common feature of all the various instances of degener- 
ate ERG models is that the corresponding distributions are highly concentrated on network configurations 
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Figure 2: Empirical quantiles of the values {i^{t), : t e T} for the base measure of the family described in 
Section 3.2. 

associated to a small number of network statistics. Therefore, in order to capture the overall degree of 
concentration of the family V, we turn to Shannon's entropy function, the rationale being that degenerate 
models have lov^rer entropy. 

Shannon's entropy function S: 6 ^ M is defined as 

S{0) = -J2 q9{x)logqeix) = ^Y.Mt)^ogpe{tMt), 

xeGg teT 

where the second summation involves a much smaller number of terms. Notice that, for every e 6, 

0<5(^) < (^^log2, 

the lower and upper bounds corresponding to a degenerate distribution with point mass at one graph, and 
to the uniform distribution over Qg (which is within the family if v{t) is constant across T and 9 = 0), 
respectively. Furthermore, as ^ is an analytic function of 6, for every 6 e Q, S{9) is a smooth function of 6. 

Noting that liniaj^o^^loga; = and using the fact that S{9) is bounded, by the dominated convergence 
theorem Corollary 2.6 yields that, for every {9, {p„}, d)-sequence with d e ri{Np), 

limS{9n) = Sf{9f) = - f peAt)^ogpeAt)dvF{t), (16) 
" Jt 

for every face F of P. 

On the other hand, because of the correspondence between natural and mean value parameters, the 
entropy function can be equivalently represented as a function over P. More precisely, we define y : P i-^ M 
as follows: if /i e rclint(P), 

= S{9), 

where — V^{9), while, for e relint(F), 

VpiiJiF) = Sf{9p), 

where ^p — Vijj^{9p). Thus, if {9n\ is a {9, {pn}, d)-sequence with d e r\{Np) and if ^„ — Eg^ [r(X)], from 
Equation (16) we obtain that 

limF(^„) = Vf(M-f), 
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(a) 



Entropy plot - Mean Value Space 




Mean number of edges 

(b) 

Entropy plot - Mean Value Space 



26 




Mean number of triangles 

Figure 3: Plots of the entropy function V{-) under mean value parametrization for the ERG model of Section 
3.2. Part a): 2-dimensional plot over the convex support P; the points correspond to the support of the 
family. Part b) : surface plot. 

where fip ~ lim„ with V{fj,) a smooth function of /i. Thus, we conclude that S{-) and V{-) have homeo- 
morphic graphs and, therefore, they convey the same information. 

Below, we use both entropy functions to illustrate the theory developed in Section 2.3 and to provide 
some characterizations of degeneracy. 

We start with Figures 3 and 4. The latter displays the entropy function S{9) for the ERG model on Gg 
with network statistic taking values in N^, as described in Section 3.2, and for values of 6 in the rectangle 
[10, 25] X [—25, 10]. The equivalent entropy function over the mean value space V{ii) is displayed in Figure 
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(a) 



Entropy plot: Natural Parameter Space 




■-10 -5 5 10 15 20 25 

Edge parameter 



(b) 

Entropy plot: Natural Parameter Space 



25 




Figure 4: Plots of the entropy function S{-) under natural parametrization for the ERG model of Section 3.2. 
Part a) : 2-dimensional plot over a square of the natural parameter space. Part b) : surface plot. 

3, for the mean value parameters {fi: fi = V^{9),6 e [10,25] x [—25,10]}. Figures 4 and 3 offer two 
equivalent views of the exponential family V via the entropy functions S{6) and V{^). The mean value 
view in Figure 3 is straightforward to interpret: the entropy function is a well behaved, strictly concave 
function that changes smoothly as the mean parameter varies inside the relative interior of P. Distributions 
with mean value parameters Ijdng well inside the cloud of points describing the support of the family have 
higher entropy, as their mass is distributed across a larger number of network configurations. In contrast, 
distributions with mean value parameters that are far removed from that cloud, including points very close 
to or on the boundary of P, have lower entropy. It is worth pointing out that, for this specific family, the 
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Entropy plot: Natural Parameter Space and All Possible MLEs 




-10 -5 O 5 10 15 20 25 

Edge parameter 



Figure 5: All possible MLEs of the natural parameters for the ERG model of Section 3.2 superimposed over 
the entropy plot of 

points in the support are closer to the lower boundary of the polygon P, while the side of P determined by 
the convex hull of points corresponding to the empty and complete graph is significantly distant from the 
support. This phenomenon becomes more pronounced as g grows, so that this family will include many 
distributions, whose mean value parameters belong to a region far removed from the support, that would 
not provide a satisfactory or realistic explanation of any observed network, a feature that is often associated 
with degeneracy. 

In striking contrast, the natural parameter view of Figure 4 does not lend itself to immediate interpre- 
tations. In fact, although S{9) and V{ii) are smooth functions related via the homeomorphism (2), S{9) 
displays drastic localized behaviors, including multiple local maxima. In particular, the function S{6) ex- 
hibits sharp changes and high-peaked ridges shooting at infinity along which it remains roughly constant. 
Furthermore, small variations in the natural parameter values cause big changes in the values of the entropy 
function, thus making this ERG model rather unstable, in the sense that neighboring parameters specify very 
different distributions, or at least distributions with different entropies. These features may in fact fall under 
the general umbrella of degeneracy, as described in Section 3.3. Finally, we remark that the portion of the 
natural parameter space containing parameter points that produce more realistic distributions with higher 
entropy values is relatively small, a characteristic that emerged from the inspection of Figure 3 as well. In 
addition, the entropy function remains relatively high along some rays leaving the origin and shooting to 
infinity. We remark that Figure 4 matches quite closely analogous plots, not based on Shannon's entropy, for 
the same ERG model on graphs with 7 nodes by Handcock (2003), although the interpretation of the plots 
using normal cones, as described below, is missing. 

Figure 5 shows all the possible MLEs corresponding to the 415 points in the support of £p that are 
inside P. These points are all the estimates that can be obtained by maximum likelihood procedure, so that, 
although the family £p contains many other distributions, inference is only restricted to the 415 distributions 
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identified by the MLEs, whose entropies are displayed in the Figure. 

Part of the seemingly strange behavior of S{9) can however be explained using the results derived in the 
previous section. To that end, the convex support of Figure 1, can be expressed either as the convex hull of 
its vertices, namely 

P = convhull{(0, 0), (20, 0), (27, 27), (30,44), (32, 56), (36, 84)} 
or, equivalently, using the 'Wp-representation, as the solution set of a system of linear inequalities, i.e. 

P = {t e M^: Ai < b}, 

where 
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The rows of A identify the outer normals to the 6 sides of the polygon P and generate the normal cones to 
the edges of P. The normal cone of a vertex of P is the conic hull of the outer normals to the edges incident 
to that vertex. For example, the normal cone of the vertex (0,0) is 

cone {(0,-1), (-21, 9)} 

The convex support P and its outer normals are shown in Figure 6. It is immediate to picture that the normal 
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Figure 6: Convex support and its outer normals for the ERG model of Section 3.2. 

fan of P, i.e. the collections of all the cones with apex at identified by the outer normals of P, partitions 
R2. 
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Entropy plot: Natural Parameter Space 




■-10 -5 5 10 15 20 25 

Edge parameter 



Figure 7: Entropy plot of S{-) with, superimposed, the normal fan of P for the ERG model of Section 3.2. 

Figure 7 shows the entropy plot over the subset [10, 25] x [—25, 10] of the natural parameter space with, 
superimposed, the normal fan of P, centered at the origin, which is the point of maximal entropy. As 
prescribed by Corollary 2.6, the outer normal to P are precisely the directions along which the closure of 
the original family £p is realized, by adding the families Sp, as F ranges over the proper faces (in this case, 
edges and vertices) of P. These directions, starting at the origin, match perfectly the ridges of Figure 4, along 
which the entropy function seems to converge to some fixed value. This is because any sequence {6n} along 
the outer normal of some edge F will eventually no longer identifies distributions from the original family 
£p, but just one distribution in Ep supported on F. Consequently, the entropy function does not change 
because, for all 12 large enough, 9n specifies almost the same distribution. 

Figures 8, 9 and 10 offer other two pictorial representations of Corollary 2.6. These plots were obtained 
using the MATLAB GUI available at http://www.stat.cmu.edu/~ariiialdo/ERG/ (see Section 9 below). 
The left side of each plot shows the entropy function for the family of Section 3.2 along with the outer 
normals of P leaving the original. The white circles represent the selected natural parameter The plots on 
the right show the support of the family. The red stars indicate the mean parameter values corresponding to 
the natural parameters indicated by the white circles on the left side of the figure. Points with darker shaded 
colors correspond to network statistics receiving high probability under the selected natural parameter 

Part (a) of Figure 8 shows a distribution with high entropy, corresponding to a mean value parameter 
well inside the relative interior of P. In contrast, in parts (b), (c) and (d) the natural parameter is selected 
as d, with d a point in the relative interior of the 2-dimensional normal cone of the vertex of coordinates 
(0, 0), which identifies the empty graph. Consequently, the entropy is almost 0, as the associated distribution 
will put almost all its mass on that vertex of P. Notice that, even though the selected natural parameters 
from part (b), (c) and (d) are very different from each others, because they are far away from the set of 
parameters producing nondegenerate distributions and because they all to lye inside the normal cone of the 
vertex (0, 0), they parametrize essentially the same degenerate distribution on the empty graph. 

Figure 9 part (a) shows the same phenomenon, but for the different degenerate distribution putting 
virtually all its mass on the complete graph, which corresponds to the vertex (36,84). As with Figure 8, 
notice that the natural parameter is a point inside the normal cone of that vertex and essentially any point in 
the upper triangular blue part of the entropy plot (which is, effectively, the relative interior of the associated 
normal cone) would parametrize this distribution. Part (b) and (c) show other degenerate distributions over 
the vertex of P identified by points inside the interiors of the corresponding normal cones. Figure 10 instead 
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displays similar plots for a selection of natural parameters corresponding to directions lying on the normal 
cones, i.e. the outer normals, of some of the edges of P. 




(d) 

Figure 8: Various distributions parametrized by points in the natural parameter space for the ERG model of 
Section 3.2. The plots on the left are the entropy plots; the white points indicate the selected distributions. 
The plots on the right all display convex support. The red crosses represent the mean value parameters 
corresponding to the selected natural parameters, while the darker shading indicates network statistics con- 
figurations that are very probably under the selected parameters. Part (a): distribution with high-entropy 
with mean value parameter inside P. Parts (b), (c) and (d): natural parameters all specifying distributions 
with virtually all of the total mass on the empty graph. 
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4 Discussion 



The purpose of this article has been two-fold. First, for the class of discrete linear exponential families with 
polyhedral convex support, we have characterized the extended family using the normal fan to the convex 
support. While complete results about closures of general exponential families exist in the literature, our 
restriction to families with polyhedral support allowed us to obtain a more refined and explicitly geometrical 
description. In particular, our findings allowed us to gain a better understanding of the geometric and 
statistical properties of these families, as well as on the theoretical and algorithmic aspects of computing 
extended maximum likelihood estimates. 

Our second goal was to study the behavior and statistical properties of ERG models, that have seen 
widespread use for the statistical analysis of data for social networks. To that end, we applied the theoretical 
results derived in the first part of the article to one ERG model on the set of graphs with 9 nodes. Despite 
our analysis being mostly graphical (due to the lack of a closed-form expression for the log-partition func- 
tion), it captures a few interesting features of this model, some of which accounts for the seemingly strange 
behaviors that ERG models have been known to exhibit in practice, and generically termed degeneracy. Our 
investigation indicated that this type of behavior is, in fact, not unusual, and can be fully explained by the 
properties of linear discrete exponential families. Furthermore, based on similar experimentations with other 
ERG models, we believe our conclusions are not just specific to the model we present here but apply more 
widely to general ERG models. 

The application presented here are particularly relevant to ERG models built around network statistics 
that describes macroscopic features of the networks and whose dimension does not grow with the number 
of nodes. However, our results apply to more complex models, such as the original pi model of Holland and 
Leinhardt (1981), which has node-specific parameters and whose likelihood is based on an assumption of 
dyadic independence. For these models with many parameters, degeneracy is tj^ically due to nonexistence 
of the MLE, which is very likely to occur if the network is even mildly sparse. 

Of course, much more needs to be done in order to fully understands the statistical subtleties, features 
and potential limitations of ERG models and in order to establish whether they are appropriate to model 
anything else than a large ensamble. Nonetheless, our contributions indicate that perhaps practitioners 
attribute to ERG models a degree of regularity that they may not possess. 
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6 Appendix A: Proofs 

Proof of Lemma 2.2. Let 9i e Qp and C e lin(A^F) and consider the point 02 — 6i + (. We first show that 
6*2 G 6f and Pe^ ~ Pe^. Because C, e Ym{Np), there exist some scalars ci . . . , Cmp such that 



Therefore, almost everywhere vp. 



(C,a;) = ^Cib, = C. 
1=1 



Then, 

^f(6'2) = log /" exp<''i+^'^> di^F{x) = log / exp<^i'^> di^pix) + C = ipriOi) + C. 

J F J F 

As both ipp{9i) and C are finite, it follows that ?Af(^2) < oo and, therefore, 02 e 6_f. It is now easy to 
conclude that Pg^ = Pg.^ because, almost everywhere lyp, 

pe,{x) = exp<''i+?-->-'^-(«^) = exp<«i'->+c^-^-(^i)-^ =pei(x). 

We now show that if Pg^ = Pg^ and 0^ ^ 02, then 0i -02 e lin(A^F)- By Radon-Nykodin theorem, this occurs 
if and only if 

{x,0,-02)=i^^{0l)~ij^{92)^D 

for some constant D, almost everywhere i^p. As has support contained in F and F is defined by (6), the 
previous equality is equivalent to 0i — 02 & lin(iVi?), thus completing the proof of the Lemma. 

As for (9), since P is full-dimensional and, almost everywhere vp, Apx — bp, we have, for any 9 £ Qp, 

= Vare ((a, X)) = lp{0)a 

if and only if a = Ym{Np). This implies that ra.-nk{lp{0)) — dim (lin(iVi?)-'-) = dim(i^). 

■ 

Proof of Lemma 2.3. Arguing by contradiction, suppose that, for all n large enough, 0n belongs to a com- 
pact, hence bounded, set C. The facts that Vtp{9) = Eg[X] s relint(P) 6, for each e 9 with finite norm, and 
that relint(P) and 9 are homeomorphic, imply that {V^{0), : 9 € C} is a compact subset of rclint(P). Then, 
because \\'Vip{9)—^p\\2 is a continuous function of 0, for all 9 e Q, infe^^gc || VV'(0n)— A*f||2 = ||V?/'(0*)— /^f||2 
for some 9* e C. But then, ViIj{9*) = ^* e relint(P) so that, ||^* — ^lp\\2 > e > for some e, which produces 
a contradiction. ■ 

Proof of Theorem 2.4. Throughout the proof, we will write Sk-i = {x ^M.^: \\x\\2 — 1}. 

In the proof we will make use repeatedly of the following decomposition. For any point xq G P and 
proper face F of P, we will write 

Pe„{xo) = - — . . -, ^, (17) 

where 

Ao,„(xo,F)= / exp<''--)+''"<'''"---«>di.(x), 

J{x: Af(x-Xo)=0} 



Ay^n{xo,F) = I exp 

J{x: Af{x~Xo)>0} 



{T].x)+p„{d„,x — xo) 



dv{x), 
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and 

J{x: Af{x-Xo)<0} 

Notice that, for all n, if xq e F, then A>^„(xo, F) — 0, since v{x: Ap{x — xq) > 0} = 0. We will also use 
the following fact, which stems directly from Lemma 2.2: exp^'''^^"'''^^'') = exp^^'^^"''''^'^) = pgp{x), almost 
everywhere vp. 

1. Party 1. 

We will begin by showing sufficiency. First, we consider the case of a generic point xq £ F. If, 
dn e ri(iVi?), then, by part 1. of Lemma 7.2, (d„, x — xq) — for all x £ F, which implies that 



Ao,n{xo,F)= / exp<"^-) di^(x) = exp^' 



(';) 

for al n. On the other hand, for any x ^ F, since i? is a compact subset oin{Np) n Sk-i and {d„} G R, 
we have 

sup{dn,x — a;o) < sup((i, x — xq) = (d*, x — xq), 

n deR 

for some d* £ R, which may depend on a;. Furthermore, by part 2. of Lemma 7.2, (d* , a; — xq) < 0. 
Thus, pn{dn,x — xo) ^ — oo, for each x ^ F. for each x £ {x: Ap{x — xq) < 0}, 

whereby 



exp ^''•^^ diy{x) < I exp<"^^> di^^x) = exp^^"' < oo. 

X : {x—xo)<0} 



Then, by the dominated convergence theorem, we obtain 

A<^r.ixo,F)\0. 

Therefore, 

Ao,„(xo,F) + A<,„(xo,F)\exp'^"(^), 

which implies that 

limpe„(xo) / exp<'''-°>-'>"W =pe,(xo). (18) 
Next, let Xo £P n F'' and notice that 

^>,o(a;o, F) + Ao,n{xo, F) + A<,„(xo, i^) > A>,„(xo, F) > / exp^"'-) diy{x), 

J F 

since F C {x: Ap{x — xq) > 0}. For any x £ F, since {d„} £ R and i? is a compact subset of 

Ti^Np) n 5fc_i, we get 

inf (dn, a; — xq) > inf (d, a; — a;o) = (d* x — xq), 

n deR 

for some d* e i?, which may depend on a:. By Lemma 7.2, part 2., Pn{d%, x — xa) oo, for all x £ F. 
But then, as > by assumption (A3), we obtain 

e^p{v,^)+P,Ad.,x-xo) ^ 

by the monotone convergence theorem. Thus, 

A>,o(a;o, F) + Ao,„(a;o, F) + ^<,„(xo, F) oo, (19) 
and, therefore, pe„{xo) —> = Pg^{xo). 
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2. Part 2. 

Suppose that, {d„} c R, where i? is a compact subset of Np. Then, there exists a subsequence 
{dn^} C {dn} such that, for all k large enough, belongs to a compact set R* such that either 
R* c vi{Np>), for some F' ^ F, or R* c (A/'(P))'^. In the latter case, by part 3., proven below, the 
numbers ||/x„^. — ^^||2 grow unbounded and, therefore, (11) is violated. 

In the former case, by part 1. of the proof, (11) is verified for F', so it cannot be simultaneously verified 
for F as well. Indeed, pg^ cannot converge pointwise to both p^i, and p^^, , which identify different 
probability distributions with different supports. 

3. Part 3. 

We will show that, if {d„} C R for some compact subset of (7V(P))'^, then, 

pgjxo)^0, VxoGP. (20) 

This implies that ||^„||2 oo. Let xo e P. As P is full-dimensional and d„ ^ -^{P), by Lemma 7.2, part 
3., the set Sn = {x £P: (c?„, x — xq) > 0} is non-empty, for each n. Furthermore, since, by assumption, 

inf inf ||d-d'||2>0, 
deRd'eAfiP) 

the set S = lim inf„5„ is non-empty as well. We now claim that i'{S) > 0. In fact, arguing by contra- 
diction, suppose that ^{S) = 0. Then, there exists a subsequence {dn^} C {dn} such that no point x e 
supp(i/) can satisfy linifc {dn^ , x — xq) > 0. However, since, by assumption (A3), P = convhull(supp(j/)), 
this implies that P C {y: lim^, (d^^^ , y — xq) < 0}, which in turn implies that lim^, dn^ G -^(P)' violating 
the condition that {dn} is bounded away from A/'(P). Thus, i'{S) > 0, from which we can conclude 
that lim inf„i^(5„) > ^{S) > 0. Then, by the monotone convergence theorem. 



Therefore, 



Jg exp<'''=^)+^"<''"^^-^''> exp<'''^) ^^-^o> rft/(.T) 

as claimed. 



Proof of Corollary 2.6. Any direction d e M'^ is either in Af(P), in which case, it must belong to ri{Np) for 
one face F of P or in (A/'(P))^. The results then follow directly from Theorem 2.4. ■ 

Proof of Corollary 2.8. If a; e ii(P), then the MLE exists, is unique and is given by the vector 9 e Q such 
that V^{9) = X. Equivalently, since in this case iVp = {0}, invoking Corollary 2.6, part 1., —i^ has no 
direction of recession. Thus consider the case of x e rb(P) and let F be the unique face such that x e ri(F). 
If d e n{NF), then by Corollary 2.6 part 1., 

lim pe+pd{x) > 0, (21) 

so (13) holds. Suppose now that d e i-hiNp). Let = {F' : x E rb(F')}, with F' being a face of P. By 
Lemma 7.2, part 4., 

[+J Yi{Np,)^i-h{NF), 
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so, if d e i-h{Np), then d G n{Npi), for some F' G Fx- By Corollary 2.6, part 1., almost everywhere vp', 



lim pe+pd = PnF< 



Since a; G F', we have uf'{x) > 0, which implies, Pgp,{x) > and, consequently, (21). Thus d is also a 
direction of recession and we have shown that any point in Np is a direction of recession for —£x, 

It remains to be shown that Equation (13) is not verified if d ^ Np. If d ^ -^(P) Corollary 2.6 part 2. 
5delds 

lim pe+pd{x) = 0, 

p — >oo 

hence ~ix{0) ^ oo, so d is not a direction of recession. If instead d e A/^(P) n iV|,, then it must be the case 
that d e ri(A^|,), for some face F* such that F n F* = 0, otherwise A^f* C rh{Np) (see, e.g.. Lemma 7.2, 
part 4.). Thus, by Corollary 2.6, part 1., 

lim pe+p<j(a;) = pL- (a;) = 0, 

p—>co 

because x ^ F*, while p^p, (x) > only if x e F*. As a result, (21) does not hold, so that d does not satisfy 
(13) and is not a direction of recession. 

■ 

Proof of Corollary 2.9. The only interesting case is when x e ri(F), for some proper face F, otherwise 
^^(P) = {0}, and —ix has no directions of recession, as the MLE exists. For every 6* e 6, let {0„} be a 
{9, {pn}, d)-sequence. By Corollary 2.8, we need to consider only the case d e Np. If d e ri{Np), by Lemma 
7.1, pe„{x) / P^pix). Now suppose that d e xhiNp). Then, d e ri(A''i?') for some face F* such that F c F*. 
Another application of Lemma 7.1, yields pe„ (x) / pfj, (a;). However, 

pH, (x) = exp^'''-)''^"* W < exp<''--> = 

since 



exp''"''* = /" exp^'''^^ dz^(z) > /" exp<'''^> di'(z) = exp'' 

J F* J F 



Thus, sup9gepe(x) = p:^p(a;) for some 7^ G Of- But sup^^ge^ p^p(a;) = (x), since only the points 6 eOp 
satisfy the first order optimality conditions (3). The result follows. ■ 

Proof of Corollary 2.10. Part i) and ii) follows from Lemma 2.2 and results of Csiszar and Matus (2003, 
2005). Part Hi) is a direct consequence of part i). ■ 

Proof of Corollary 2.11. For any e 6, the (i, j)-th entry oi I{6) is (see, e.g.. Corollary 2.3 in Brown, 1986) 

From the proof of Theorem 2.4, V'(^n) ^ ^^^{0 + Q, for every C € \in{Np). Then, by the analytic properties 
of the cumulant generating function (see, e.g. Brown, 1986, Chapter 2), we obtain 

for every Q e lin(iVi?), hence the statement is proved. ■ 
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7 Appendix B 

The following lemma in needed in the proof of Corollary 2.9 

Lemma 7.1. Under the conditions of Corollary 2.6, po,^ / Pg^, a.e. vp, if and only ifd£ relint(A^i?)- 

Proof The claim follows from Equation (18) in the proof of Theorem 2.4, which holds for all x & F, thus 
almost everywhere vp. ■ 

Below, we collect some basic facts about the normal fan and normal cones needed in our proofs. With 
some slight abuse of notation, we say that a vector d is normal to the hyperplane H \i {d,x ~ y) =0 for all 

Lemma 7.2. Let P he full-dimensional and let F he a face ofF. 

1. For any xq e F, {a^,x — xq) = for all x £ F and (a^,x — xq) < for all x ^ F if and only if 

e relint(A''i^). 

2. For any .tq ^ F, {a^,x — xq) > for all x e F and {a^,x — xq) < for all x ^ F if and only if 

e rclint(Afi?). 

3. If A/'(P), then, for any x^ e P, 

where S^^x,,, S^^^o Q^nd S'<.a;o are disjoint, non-empty sets given hy {x e P: (d,x — xq) > 0}, {x G 
P : (d,x — Xq) ~ 0} and {a; e P : {d,x — Xq) > 0}, respectively. 

4. rh{NF) = 1+)^/ . p'^F ^M^F'), where the disjoint union ranges over all the faces F' of P. 

5. Np — cone (ai, . . . , amp), where denotes the transpose of the i-th row of the submatrix Ap given in 
(6), i ~ I . . . , mp. 

Proof Recall that, since P is full-dimensional, there is no vector d ^ such that {d, x — xq) — for all pairs 

X, Xq G p. 

1. First we show sufficiency. If € relint(iVi?), then is a conic combination of all the rows of Ap 
with positive coefficients. Therefore, {a^,x — xq) ~ for all a; G F, by the definition of F, and 
(a^, X — Xq) < for all x ^ F, since, in this case, (a, x — xq) < for some row a of Ap. As for necessity, 
if g Np, then {a^ ,x — xq) < for all x € ri(P). However, if £ Th{Np), then {a^, x — xq) = 
for all X £ F', where F' is the face of P such that £ ri(iVi?/). But then, since F c F', there exists a 
a; ^ for which (a^, x — xq) = 0, which would produce a contradiction. Thus ^ rh{Np). 

2. The proof is analogous to the previous case and is omitted. 

3. Since d is not normal to any supporting hyperplane, the hyperplane H = {x: {d,x — xq) ~ 0} intersects 
P is in its relative interior, and P must have non-empty intersections with both the halfspaces {x £ 

R'' : {d,x- Xq) > 0} and {x e M'' : (d, x - xa) < 0} cut out by H. 

4. The claim follows directly from the definition of Np and the fact that Af{P) is a polyhedral complex 
(see, e.g., Sturmfels, 1995), thus the relative boundary of Np is the disjoint union of the relative 
interiors of all its faces. 

5. Let c £ cone (ai, . . . , a,„j^), so that c = Aj,X, where A £ R'' has nonnegative coordinates. Then, for all 

X £ F and y e P n 

{c,x) = {X,Apx) = {X,bp) > {X,Apy) = {AlX,y) = {c,y) 

since Apx = bp and Apy < bp. Thus, c £ Np and we have shown that cone (oi, . . . , 0^^) C Np. 
Conversely, assume that c is a nonzero vector in Np but c ^ cone (ai, . . . , 0^^^). Then, c is not normal 
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to any supporting h5^erplane of F, which implies that there exists a x £ F and y e P n such that 
(c, X — y) < 0, producing a contradiction. Thus, it must be the case that c e cone (ai, . . . , amp) as well, 
yielding Np Q cone (ai, . . . , amp)- 



8 Appendix C: Checking for the existence of the MLE via Linear Pro- 
gramming. 

Deciding whether the MLE exists, that is, whether the vector of observed sufficient statistics x is such that 
X e ri(P) is particularly simple if one has access to a H representation of P as in (5), as indicated in the next 
result, of immediate verification. 

Lemma 8.1. The MLE exists if and only if the system Ax < bis satisfied with strict inequalities. 

Unfortunately, this type of representation is typically not available or prohibitively hard to compute, even 
when k is small, since P may have a number of faces that grow super-exponentially in k (see, for example, 
Ziegler, 2001). 

If instead only a V representation (7) is available or computable, the existence of the MLE can be estab- 
lished using linear programming, as outlined below. Let 5 be a matrix whose columns contain the vertices 
and extreme rays of P, namely the vectors in Q and C from Equation (7). Then x e ri(P) if and only if x can 
be obtained as a linear combinations of the vectors in Q and C with strictly positive coefficients. 

Lemma 8.2. The MLE exists if and only if x = Bz, for a vector z with strictly positive coordinates. 

This is a feasibility problem which can be decided by solving the linear program 

maxs 
s.t. Bz — X 

Zi — s > 
s > 0, 

where Zi denotes the i-th coordinate of z and s is a scalar If {s*,z*) is the optimum, then the MLE exists if 
and only if s* > 0. 

An alternative linear program, which may be computationally preferable, can be formulated based on 
Theorem 8.3, whose proof can be found in Schrijver (1998), as follows: 

niax(l, y) 
s.t. B^y = 

y > 
2/ < 1. 

If y* is the optimum, the MLE does not exist if and only if (1, y*) > 0. 

Theorem 8.3 (Gordan's Theorem of Alternatives). Given a matrix B, the following are alternatives: 

1. Bx > has a solution x. 

2. B^y = 0, y >0, has a solution y. 
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9 Appendix D: Software 



The code used for the analysis and for the figures of the paper is available on the web at 

http : //www . Stat . emu . ed.u/~arinaldo/ERG/ 

The software includes: 

1. the MATLAB GUI used for creating Figures 8, 9 and 10 and some short movies showing the relationship 
between sequences of natural parameters moving along the outer normals of P and the corresponding 
sequences of mean values; 

2. an MPI C++ program for complete enumeration of all undirected graphs on n nodes and for counting 
the number of edges, triangles, fc-stars and alternating fc-stars. However, complete enumeration is only 
feasible only for very small graph. Using our program, which can certainly be be improved, it took 
about 1 hour on a 64-node cluster to enumerate all graphs on 9 nodes, but for the 10-node graph, the 
estimated running time is about 26.5 days. 
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